Method and system for calculating term-document importance

ABSTRACT

A weighting system for calculating the term-document importance for each term within each document that is part of a collection of documents (i.e., a corpus). The weighting system calculates the importance of a term within a document based on a computed normalized term frequency and a computed inverse document frequency. The computed normalized term frequency is a function, referred to as the “computed term frequency function” (“A”), of a normalized term frequency. The normalized term frequency is the term frequency, which is the number of times that the term occurs in the document, normalized by the total term frequency of the term within all documents, which is the total number of times that the term occurs in all the documents. The weighting system normalizes the term frequency by dividing the term frequency by a function, referred to as the “normalizing term frequency function” (“Γ”), of the total term frequency. The computed inverse document frequency is a function, referred to as the “computed inverse document frequency function” (“B”) of the inverse document frequency. The weighting system identifies a computed normalized term frequency function A and a computed inverse document frequency function B so that on average the computed normalized term frequency and the computed inverse document frequency contribute equally to the weight of the terms.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of pending U.S. ProvisionalApplication No. 60/103,718, filed Oct. 9, 1998, which application isincorporated by reference in its entirety.

TECHNICAL FIELD

This invention relates generally to a computer system for searching fordocuments, and more specifically to calculating the term-documentimportance of terms within a collection of documents.

BACKGROUND OF THE INVENTION

The Internet is a collection of interconnected computer systems throughwhich users can access a vast store of information. The informationaccessible through the Internet is stored in electronic files (i.e.,documents) under control of the interconnected computer systems. It hasbeen estimated that over 50 million documents are currently accessiblethrough the Internet and that the number of documents is growing at therate of 75% per year. Although a wealth of information is stored inthese documents, it has been very difficult for users to locatedocuments relating to a subject of interest. The difficulty arisesbecause documents are stored in many different computer systems, and theInternet provides no central mechanism for registering documents. Thus,a user may not even know of the existence of certain documents, letalone the subject matter of the documents. Each document that isaccessible through the Internet is assigned a unique identifier, whichis referred to as a uniform resource locator (“URL”). Once a user knowsthe identifier of a document, the user can access the document. However,even if a user knows the identifiers of all documents accessible throughthe Internet, the user may not know the subject matter of the document.Thus, the user may have no practical way to locate a document relatingto a subject of interest.

Several search engines have been developed to assist users to locatedocuments relating to a subject of interest. Search engines attempt tolocate and index as many of the documents provided by as many computersystems of the Internet as possible. The search engines index thedocuments by mapping terms that represent the subject matter of eachdocument to the identifier of the document. When using a search engineto locate documents relating to a subject of interest, the user enterssearch terms that describe the subject of interest. The search enginethen searches the index to identify those documents that are mostrelevant to the search terms. In addition, the search engine may presentthe search results, that is the list of relevant documents, to the userin order based on the relevance to the search term. The user can thenselect and display the most relevant documents.

The accuracy of the search results depends upon the accuracy of theindexing used by a search engine. Unfortunately, there is no easy wayfor a search engine to determine accurately the subject matter ofdocuments. The difficulty in determining the subject matter of adocument is compounded by the wide variety of formats (e.g., as a wordprocessing documents or as a hyper-text document) and the complexity ofthe formats of the documents accessible through the Internet. To make iteasier for a search engine to determine the subject matter of adocument, some document formats have a “keyword” section that provideswords that are representative of the subject matter of the document.Unfortunately, creators of documents often fill the “keyword” sectionwith words that do not accurately represent the subject matter of thedocument using what is referred to as “false promoting” or “spamming.”For example, a creator of a classified advertising web page forautomobiles that may fill the “keyword” section with repetitions of theword “car.” The creator does this so that a search engine will identifythat web page as very relevant whenever a user searches for the term“car.” However, a “keyword” section that more accurately represents thesubject matter of the web page may include the words “automobile,”“car,” “classified,” “for,” and “sale.”

Because the document formats have no reliable way to identify thesubject matter of a document, search engines use various algorithms todetermine the actual subject matter of documents. Such algorithms maygenerate a numerical value for each term in a document that ratesimportance of the term within the document. For example, if the term“car” occurs in a document more times than any other term, then thealgorithm may give a high numerical value to the term “car” for thatdocument. Typical algorithms used to rate the importance of a termwithin a document often factor in not only the frequency of theoccurrence of term within the document, but also the number of documentsthat contain that term. For example, if a term occurs two times in acertain document and also occurs in many other documents, then theimportance of that term to the document may be relatively low. However,if the term occurs two times in that document, but occurs in no otherdocuments, then the importance of that term within the document may berelatively high even though the term occurs only two times in thedocument. In general, these algorithms attempt to provide a high“information score” to the terms that best represent the subject matterof a document with respect to both the document itself and to thecollection of documents.

To calculate the importance or “information score,” typical algorithmstake into consideration what is referred to as the term frequency withina document and the document frequency. The term frequency of a term isthe number of times that the term occurs in the document. The termfrequency for term i within document j is represented as TF_(ij). Thedocument frequency of a term is the number of documents in which theterm occurs. The document frequency for term i is represented as n_(i).One such algorithm uses the Salton Buckley formula for calculating theimportance of terms. The formula is given by the following equation:$\begin{matrix}{W_{ij} = {\log_{2}{TF}_{ij}*\log_{2}\frac{N}{n_{i}}}} & (1)\end{matrix}$

where W_(ij) is the numerical value (i.e., weight) of the importance ofthe term i to the document j, where TF_(ij) is the term frequency, wheren_(i) is the document frequency, and where N is the total number ofdocuments in a collection of documents. The quotient N/n_(i) is referredto as the inverse document frequency, which is the inverse of the ratioof the number of documents that contain the term to the total number ofdocuments. As the term frequency increases, the weight calculated bythis formula increases logarithmically. That is, as the term occurs morefrequently in a document, the weight of that term within the documentincreases. Also, as the document frequency increases, the weightdecreases logarithmically. That is, as a term occurs in more documents,the weight of the term decreases. It is, of course, desirable to use aformula that results in weights that most accurately reflect theimportance or information score of terms.

SUMMARY OF THE INVENTION

An embodiment of the present invention provides a weighting system forcalculating the weight for a term within one document in a collection ofdocuments. The weighting system first generates a term frequency thatrepresents the number of times that the term occurs in the one document.The weighting system also generates a total term frequency thatrepresents a total number of times the term occurs in the collection ofdocuments. The weighting system then calculates a normalized termfrequency by factoring the generated term frequency by a normalizingfunction of the generated total term frequency. The weighting systemcombines the calculated normalized term frequency with a documentfrequency to generate the weight for the term. In one embodiment, thenormalizing function is a function that is based on the square root ofthe generated total term frequency. In another embodiment, thenormalizing function is a function that is based on a logarithm of thegenerated total term frequency. The weighting system also may usevarious different algorithms for generating an improved term frequencythat more accurately represents the importance of a term. In oneembodiment, the weighting system uses various factors, such as theformatting (e.g., italics) of a term and the number of unique termswithin the document, to generate the improved term frequency.

In another embodiment, the weighting system identifies a formula forweighting terms within a collection of documents. The weighting systemgenerates an average term frequency that represents an average of termfrequencies for each term within each document. The term frequency isthe number of times that a term occurs in a document. The weightingsystem then generates an average inverse document frequency thatrepresents an average of inverse document frequencies for each term. Theinverse document frequency of a term is the number of documents in thecollection divided by the number of documents in which the term occurs.The weighting system then identifies a first function of the generatedaverage term frequency and a second function of the generated averageinverse document frequency so that the first function of the generatedaverage term frequency is approximately equal to the second function ofthe generated average inverse document frequency. In one embodiment, thefirst and second functions are logarithmic, and the weighting systemidentifies bases for each function to achieve the equality.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computer system for executing theweighting system.

FIG. 2 is a flow diagram of the weighting component.

FIG. 3 is a flow diagram of a routine to calculate the weight of a term.

FIG. 4 is a flow diagram of the document selector component.

FIG. 5 is a flow diagram of a routine to calculate the bases for theweighting function.

DETAILED DESCRIPTION OF THE INVENTION

An embodiment of the present invention provides a weighting system forcalculating the term-document importance for each term within eachdocument that is part of a collection of documents (i.e., a corpus). Theweighting system calculates the importance of a term within a documentbased on a computed normalized term frequency and a computed inversedocument frequency. The computed normalized term frequency is afunction, referred to as the “computed normalized term frequencyfunction” (“A”), of a normalized term frequency. The normalized termfrequency is the term frequency, which is the number of times that theterm occurs in the document, normalized by the total term frequency ofthe term within all documents, which is the total number of times thatthe term occurs in all the documents. The weighting system normalizesthe term frequency by dividing the term frequency by a function,referred to as the “normalizing term frequency function” (“Γ”), of thetotal term frequency. The computed inverse document frequency is afunction, referred to as the “computed inverse document frequencyfunction” (“B”) of the inverse document frequency. Thus, the importanceof a term within a document is represented by the following equation:$W_{ij} = {{A\left( \frac{{TF}_{ij}}{\Gamma \left( {TF}_{i} \right)} \right)}*{B\left( \frac{N}{n_{i}} \right)}}$

where W_(ij) represents the importance of term i within document j,where TF_(ij) represents the term frequency for term i within documentj, where TF_(i) represents the total term frequency for term i, wheren_(i) represents the number of documents that contain term i, and Nrepresents the number of documents. The selection of the variousfunctions that are used by the weighting system are made to improve theaccuracy of the importance calculation.

In one embodiment, the normalizing term frequency function Γ is theinverse of the square root of the total term frequency. Therefore, as aterm increasingly occurs in the corpus, the influence of the termfrequency in the importance calculation is reduced by a factor that isthe square root of the number of occurrences. That is, if the termoccurs 16 times throughout the corpus, then the term frequency isreduced by the factor of 4, which is the square root of 16. However, ifthe term occurs 64 times throughout the corpus, then the factor is 8,which is the square root of 64. In an alternate embodiment, thenormalizing term frequency function Γ is a logarithmic function of thetotal term frequency. Therefore, as a term increasingly occurs in thecorpus, the influence of the term frequency in the importancecalculation is reduced by a factor that is a logarithm of the total termfrequency. That is, if the term occurs 16 times throughout the corpus,then the term frequency is reduced by a factor of 4, which is thelogarithm (base 2) of 16. However, if the term occurs 64 timesthroughout the corpus, then the factor is 6, which is the logarithm(base 2) of 64. In one embodiment, the weighting system uses alogarithmic function for both the computed normalized term frequencyfunction A and the computed inverse document frequency function B.

A goal of the weighting system is to derive the term-document importanceby giving equal weight to the computed normalized term frequency and thecomputed inverse document frequency. To achieve this equal weighting,the weighting system in one embodiment uses different bases for thelogarithm of the computed normalized term frequency function A and thecomputed inverse document frequency function B. Traditional algorithmshave used the same bases for the logarithmic function of the termfrequency and the inverse document frequency. However, with suchtraditional logarithmic functions, the term frequency and inversedocument frequency do not in general contribute equally to thecalculation of importance. For example, if a corpus containsclosely-related documents, then most terms will occur in most of thedocuments and the influence of inverse document frequency will be aboutthe same for each term. As a result, the influence of term frequency onthe calculation of the importance will be much greater than that of theinverse document frequency. To equalize the influence, the weightingsystem calculates the bases for the computed normalized term frequencyfunction A and the computed inverse document frequency function B basedon the average term frequencies and the average document frequenciesthroughout the corpus. By computing the bases for the logarithms, theweighting system ensures that on average, the computed normalized termfrequency and the computed inverse document frequency contribute equallyto the calculated importance of the terms. In one embodiment, theweighting system also calculates the base for the logarithmic-version ofthe normalizing term frequency function Γ. The base is calculated sothat the upper limit of the effect of the normalizing term frequencyfunction Γ is the average term frequency.

The weighting system also provides various techniques for generating aterm frequency that is a more accurate representation of the importanceof a term than the number of occurrences of the term within thedocument. Such generated term frequencies are referred to as improvedterm frequencies. When generating an improved term frequency, theweighting system takes into consideration factors such as formatting(e.g., italics) of the term within the document, the number of uniqueterms within the document, and whether the term is displayed when thedocument is displayed. The various improved term frequencies can be usedin place of the term frequency when calculating the weight of a term.

FIG. 1 is a block diagram of a computer system for executing theweighting system. The computer system 110 includes a memory 111, centralprocessing unit 112, and various I/O devices 113. The input/outputdevices may include a monitor, keyboard, and storage devices. Thestorage devices may include a disk, a CD-ROM, or other computer-readablemedia. The computer system 110 can access the corpus 100 of documentsthat may be contained on a storage device directly connected to thecomputer or on storage devices of other computer systems. The corpuscontains documents 101 and each document comprises one or more terms102. The weighting system 114 comprises a term frequency matrix 114 a, aweighting component 114 b, and a term-document weight matrix 114 c. Theterm frequency matrix maps each term to the number of times that itoccurs in each document. Thus, the number of rows in the term frequencymatrix is equal to the number of unique terms (“M”) in the corpus, andthe number of columns in the matrix is equal to the number of documents(“N”) in the corpus. The term frequency matrix can be pre-generated orgenerated by the weighting system by searching through the corpus. Theweighting component determines the term-document importance (weight) ofeach term within each document. The weighting component inputs the termfrequency matrix and generates the term-document weight matrix. Theterm-document weight matrix is the same size as the term frequencymatrix and has a row for each unique term in the corpus and a column foreach document in the corpus. The term-document weight matrix containsthe term-document weight for each term within each document. One skilledin the art will appreciate that the matrices can be stored using variouscompaction techniques. For example, the term-document weight matrixcould store only those weights above a certain threshold level. Thedocument selector component 115 inputs a search term from the querycomponent 116 and identifies those documents whose term-document weightfor that search term is highest as indicated by the term-document weightmatrix.

FIG. 2 is a flow diagram of the weighting component. The weightingcomponent selects each term and each document and calculates the weightfor the selected term and document. The weighting component stores thecalculated weights in the term-document weight matrix. The formulas usedto calculate the weight are described below in detail. In steps 201-206,the weighting component loops selecting each unique term i in thecorpus. In step 201, the component selects the next term starting withthe first. In step 202, if all the terms have already been selected,then the component is done, else the component continues at step 203. Insteps 203-206, the component loops selecting each document j for theselected term. In step 203, the component selects the next documentstarting with the first. In step 204, if all the documents have alreadybeen selected for the selected term, then the component loops to step201 to select the next term, else the component continues at step 205.In step 205, the component calculates the weight for the selected termwithin the selected document. In step 206, the component stores thecalculated weight in the term-document weight matrix indexed by theselected term and document. The component then loops to step 203 toselect the next document for the selected term.

FIG. 3 is a flow diagram of a routine to calculate the weight of a term.This routine is passed an indication of the selected term and theselected document and returns the calculated weight. In step 301, theroutine retrieves the term frequency TF_(ij) for the selected term i andthe selected document j from the term frequency matrix. In step 302, theroutine retrieves the number of documents n_(i) in which the selectedterm occurs. The number of documents can be calculated by counting thenumber non-zero entries in the row corresponding to the selected term ofthe term frequency matrix. In step 303, the routine retrieves the numberof occurrences of the term throughout the corpus. The number can becalculated by totaling the number of occurrences in each entry in therow for the selected term. In step 304, the routine calculates theweight to be a combination of the computed normalized term frequencyfunction A and the computed inverse document frequency function B andthen returns.

FIG. 4 is a flow diagram of the document selector component. Thedocument selector component receives a search term from a query andselects the documents whose weights are highest for that term. In oneembodiment, this component generates a top 10 list of the documents withthe highest weights. In step 401, a component receives the search term.In steps 402-406, the component loops selecting the documents anddetermining the weight for the search term for that document. In step402, the component selects the next document starting with the firstdocument. In step 403, if all the documents have already been selected,then the component completes, else the component continues at step 404.In step 404, the component retrieves the weight for the search term andthe selected document from the term-document weight matrix. In step 405,if the retrieved weight is greater than the lowest weight in the currenttop 10 list, then the component continues at step 406, else thecomponent loops to step 402 to select the next document. In step 406,the component adds the selected document to the top 10 list and loops tostep 402 to select the next document.

Weighting Formula

The weighting system calculates the weight of a document for a term thatis based on the term frequency and the inverted document frequency. Thecalculation of the weight W is represented by the following equation:

W _(ij) =CNTF _(ij) *CIDF _(i)  (2.1)

where W_(ij) is the weight for term i and document j, where CNTF_(ij) isthe computed normalized term frequency as described below for term i anddocuments, and CIDF_(i) is the computed inverse document frequency asdescribed below for term i.

The computed inverse document frequency is derived by applying thecomputed inverse document frequency function B to the inverse documentfrequency. In one embodiment, the computed inverse document frequencyfunction B is a logarithmic function given by the following equation:$\begin{matrix}{{CIDF}_{i} = {{B\left( \frac{N}{n_{i}} \right)} = {\log_{\beta}\frac{N}{n_{i}}}}} & \text{(2.2)}\end{matrix}$

where N is the number of documents in the corpus, where n_(i) is thenumber of documents that contain the term i, and where β is the base ofthe logarithm.

The computed term frequency is derived by applying the computednormalized term frequency function A to the results of applying thenormalizing term frequency function Γ to the total term frequency. Inone embodiment, the computed term frequency function A is thelogarithmic function given by the following equation: $\begin{matrix}{{CNTF}_{ij} = {{A\left( \frac{{TF}_{ij}}{\Gamma \left( {TF}_{i} \right)} \right)} = {\log_{\alpha}\left( {\left( {\alpha - 1} \right) + \frac{{TF}_{ij}}{\Gamma \quad \left( {TF}_{i} \right)}} \right)}}} & \text{(2.3)}\end{matrix}$

where α is the base of the logarithm. The normalized term frequencyNTF_(ij) for the term i within documents is given by the followingequation: $\begin{matrix}{{NTF}_{ij} = \frac{{TF}_{ij}}{\Gamma \quad \left( {TF}_{i} \right)}} & \text{(2.4)}\end{matrix}$

As discussed above, the normalizing term frequency function Γ is eitherbased on the square root of the total term frequencies or a logarithm ofthe total term frequencies. A square root based normalizing termfrequency function Γ is given by the following equation: $\begin{matrix}{{\Gamma \left( {TF}_{i} \right)} = \left( {1 + {\sum\limits_{k = 1}^{n_{i}}{TF}_{ik}}} \right)^{1/2}} & \text{(2.5)}\end{matrix}$

One skilled in the art would appreciate that roots (i.e., powers between0 and 1) other than the square root can be used (e.g., cube root). Alogarithmic-based normalizing term frequency function is given by thefollowing equation: $\begin{matrix}{{\Gamma \left( {TF}_{i} \right)} = {\log_{\gamma}\left( {\left( {\gamma - 1} \right) + {\sum\limits_{k = 1}^{n_{i}}{TF}_{ik}}} \right)}} & \text{(2.6)}\end{matrix}$

where γ is the base of the logarithm. The weighting system preferablyuses a square root or logarithmic based normalizing term frequencyfunction. Since the square root of a number is in general larger thanthe logarithm of a number (at least as the number approaches infinity),a square root function tends to lower the contribution of the termfrequency to the weight as compared to a logarithmic function.

Calculation of Base β

The weighting system attempts to equalize the contribution of thecomputed normalized term frequency and the computed inverse documentfrequency to the calculation of the weight. The base β of the computedinverted document frequency function is selected based on the change inthe number of documents in which a term occurs that is desired to changethe value of the computed inverse document frequency by 1. For example,if a factor of 10 change in the number of documents is desired to changethe computed inverse document frequency by I, then base β should be 10.Assuming a corpus of 10 million documents and base β of 10, and assumingthat a term occurs in 100 documents, then the computed inverted documentfrequency will be the logarithm base 10 of 10 million divided by 100,which is 5. However, if the term occurs in 1000 documents, then thecomputed inverse document frequency will be 4. If the corpus isrelatively small (e.g., 1000 documents), then the base β may be selectedto be small so that the computed inverse document frequency can varyover a large range. For example, if there are 1000 documents in acorpus, and base β is 10, then the computed inverse document frequencyranges only from 0-3. However, if the base β is 2, then the computedinverse document frequency ranges from 0-10.

Calculation of Base α

Once the base β is selected, the weighting system selects the base α sothat the computed normalized term frequency and the computed inversedocument frequency on average contribute an equally to the weight. Theaverage term frequency aTF_(j) for documents is given by the followingequation: $\begin{matrix}{{aTF}_{j} = \frac{\sum\limits_{i = 1}^{M_{j}}{TF}_{ij}}{M_{j}}} & \text{(3.1)}\end{matrix}$

where M_(j) is the number of terms in document j. The average termfrequency aTF across all documents is given by the following equation:$\begin{matrix}{{aTF} = \frac{\sum\limits_{j = 1}^{N}{aTF}_{j}}{N}} & \text{(3.2)}\end{matrix}$

The average number of documents in which a term occurs an is given bythe following equation: $\begin{matrix}{{an} = \frac{\sum\limits_{i = 1}^{M}n_{i}}{M}} & \text{(3.3)}\end{matrix}$

where M is the number of terms in the corpus. The average computedinverse document frequency is thus given by the following equation:$\begin{matrix}{{a\quad {CIDF}} = {\log_{\beta}\frac{N}{a\quad n}}} & (3.4)\end{matrix}$

The average computed normalized term frequency is given by the followingequation: $\begin{matrix}{{a\quad {CNTF}} = {\log_{\alpha}\quad \left( {\left( {\alpha - 1} \right) + \frac{a\quad {TF}}{\Gamma \quad \left( {a\quad {TF}} \right)}} \right)}} & (3.5)\end{matrix}$

The value of the base α and base β are determined for when the averagecomputed inverse document frequency equals the average computednormalized term frequency as provided by the following equation:$\begin{matrix}{{\log_{\beta}\frac{N}{a\quad n}} = {\log_{\alpha}\quad \left( {\left( {\alpha - 1} \right) + \frac{a\quad {TF}}{\Gamma \quad \left( {\Sigma \quad a\quad {TF}} \right)}} \right)}} & (3.6)\end{matrix}$

where ΣaTF represents the total term frequency based on the average termfrequency. Equation (3.6) can be solved for the various possiblenormalizing term frequency functions Γ.

Calculation of Base α for the Salton Buckley Formula

The Salton Buckley formula would use a normalizing term frequencyfunction Γ given by the following equation:

Γ(ΣaTF)=1  (4.1)

The result of such an equation is that normalization does not occur.When this function is inserted into the equation for the averagecomputed normalized term frequency results in the following equation:

αCNTF=log_(α)((α−1)+aTF)  (4.2)

This equation can be represented using logarithms of base β as given bythe following equation: $\begin{matrix}{{a\quad {CNTF}} = \frac{\log_{\beta}\quad \left( {1 + {a\quad {TF}}} \right)}{\log_{\beta}\alpha}} & (4.3)\end{matrix}$

assuming that α<<aTF. When the average computed inverse documentfrequency is set equal to the average computed normalized termfrequency, it results in the following equation: $\begin{matrix}{{\log_{\beta}\frac{N}{a\quad n}} = \frac{\log_{\beta}\quad \left( {1 + {a\quad {TF}}} \right)}{\log_{\beta}\alpha}} & (4.4)\end{matrix}$

The solution for base a is given by the following equation:$\begin{matrix}{\alpha = \left\lceil {\beta \quad \left( \frac{\log_{\beta}\quad \left( {1 + {a\quad {TF}}} \right)}{\log_{\beta}\frac{N}{a\quad n}} \right)} \right\rceil} & (4.5)\end{matrix}$

Calculation of Base α for the Square Root Function

When the normalizing term frequency function Γ is a square root functionof equation (2.5), the solution for base a is given by the followingequation: $\begin{matrix}{\alpha = \left\lceil {\beta \quad \left( \frac{\log_{\beta}\quad \left( {\beta - 1 + {a\quad {TF}*\left( {1 + {a\quad n*a\quad {TF}}} \right)^{{- 1}/2}}} \right)}{\log_{\beta}\frac{N}{a\quad n}} \right)} \right\rceil} & (5)\end{matrix}$

Base α for the Logarithmic Function

When the normalizing term frequency function Γ is a logarithmicfunction, the solution for base a is given by the following equation:$\begin{matrix}{\alpha = \left\lceil {\beta \quad \left( \frac{\log_{\beta}\quad \left( {\beta - 1 + {a\quad {TF}*\left( {\log_{\gamma}\quad \left( {\gamma - 1 + {a\quad n*a\quad {TF}}} \right)} \right)^{- 1}}} \right)}{\log_{\beta}\frac{N}{a\quad n}} \right)} \right\rceil} & (6)\end{matrix}$

Calculation of Base γ for the Logarithmic Function

The weighting system sets the base γ so that the upper limit of theeffect of the normalizing term frequency function Γ is equal to theaverage term frequency. In addition, the weighting system approximatesthe total term frequency for the term i as the product of the averageterm frequency for all terms times the average document frequency forall terms. This approximation is given by the following equation:$\begin{matrix}{{\sum\limits_{k = 1}^{n_{i}}\quad {TF}_{ik}} \cong {a\quad {TF}*a\quad n}} & (7.1)\end{matrix}$

The weighting system also assumes that the value of γ−1 is much lessthan the average total of the term frequency as represented by thefollowing equation:

γ−1<<aTF*an  (7.2)

The upper limit of the total term frequency function is the average termfrequency as shown by the following equation:

log_(γ)(aTF*an)=aTF  (7.3)

The solution for the base y is given by the following equation:$\begin{matrix}{\gamma = \left\lceil \left( {{aTF}*a\quad n} \right)^{\frac{1}{a\quad {TF}}} \right\rceil} & (7.4)\end{matrix}$

FIG. 5 is a flow diagram of a routine to calculate the bases for theweighting function. In step 501, the routine selects the inversedocument frequency ratio that will result in a change of 1 in thecomputed inverse document frequency. In one embodiment, this ratio isinput to the weighting system. In step 502, the routine sets base βequal to the selected inverse document frequency ratio. In step 503, theroutine calculates the average term frequency and the average documentfrequency using equations (3.2) and (3.3). In step 504, the routine setsbase γ according to equation (7.4). In step 506 and 507, the routinesets the base α according to equation (6).

In an alternate embodiment, the weighting system calculates the weightof term i within documents using the following equation: $\begin{matrix}{W_{ij} = \frac{\log_{2}\quad \left( {{TF}_{ij} + 1} \right)}{\sum\limits_{k = 1}^{n_{i}}\quad {\log_{2}\quad \left( {{TF}_{ik} + 1} \right)}}} & (8)\end{matrix}$

This equation is an entropy measure of the term frequency. This entropymeasure has the desirable characteristics that the weight ranges from 0to 1; that the sum of the weights of a term across all documents is 1;that if a term occurs in only one document, then its weight is 1; andthat if a term occurs in all documents an equal number of times (e.g.,“the” or “a”), then the weight for each document is 1/N.

Improved Calculation of Term Frequency

The count of the number of occurrences of a term within a document maynot accurately reflect the importance of that term to the document. Forexample, as discussed above, documents accessible through the Internetoften will contain a certain term many times so that search engines thatuse only the number of occurrences of a term as an indication of theimportance of the term will retrieve that document in response to asearch for that term. In many cases, that term may have little to dowith the subject matter of the document. In one embodiment, theweighting system calculates an improved term frequency as a moreaccurate measure of the importance of a term to a document. The improvedterm frequency can be used in any weighting formula such as thosedescribed above. When generating the improved term frequency for a termwithin a document, the weighting system factors in the following:

1. The number of occurrences of the term in the document to (i.e., rawterm frequency)—TF_(Raw)

2. The structures within the document in which the term occurs (e.g.,abstract)—C_(Structure)

3. The formatting (e.g., bold, italics, highlight) of the term withinthe document—C_(Format)

4. The proximity of the term to the beginning of thedocument—C_(ClosenessToTop)

5. The distribution of the term among the sentences of thedocument—C_(Distribution)

6. The (inverse of) the number of unique terms in the document (i.e.,vocabulary)—C_(UniqueTermCount)

7. The (inverse of) the total number of occurrences of all term withinthe documents—C_(TotalTermCount)

8. The strength of false promoting within the document—C_(Spamming)

The weighting system uses these factors when calculating the improvedterm frequency. In the following, an enhanced term frequencyTF_(Enhanced) and a adjusted term frequency TF_(Adjusted) are described.Either the enhanced or the adjusted term frequency can be used as theimproved term frequency.

When calculating the enhanced term frequency, the weighting systemfactors in the first five factors, which are characteristics of the termwithin the document independent of any other terms. For example, theformat of the term is independent of any other terms in the document.The enhanced term frequency is defined by the following equations:

TF _(Enhanced) =TF _(Raw) *K2*(1+K1l)  (9.1)

K1=Base*(C _(Structure) +C _(Format) +C _(ClosenessToTop))  (9.2)

K2=(1+C _(Distribution))  (9.3)

The factor Base is equal to the base γ of the normalizing term frequencyfunction Γ. If the normalizing term frequency function is notlogarithmic, then the factor Base is equal to 1. Thus, the factor Basecompensates for whatever value of base γ is selected when a logarithmicfunction is used.

When calculating the adjusted term frequency, the weighting systemfactors in enhanced term frequency along with the last three factors,which are characteristics of other terms within the document. Forexample, the number of unique terms within a document is based on allterms within the document. The adjusted term frequency is defined by thefollowing equations:

TF _(Adjusted) =TF _(Enhanced) *K3*K4  (10.1)

$\begin{matrix}{{K3} = \frac{1}{{Log}_{2}\left( {1 + \frac{C_{TotalTermCount}}{C_{UniqueTermCount}}} \right)}} & (10.2)\end{matrix}$

 K4=(1−C _(SpamStrength))  (10.3)

The ratio of total term count to the size of the vocabulary within adocument is the average term frequency. As the average term frequencyincreases, then the significance of term frequency to the term-documentimportance decreases. That is, if the average term frequency is close to1 (which is the minimum possible), then the term frequency should be asignificant factor in the term-document importance. In contrast, if theaverage term frequency is very large, then the term frequency should notbe a very significant factor to the term-document importance. Factor K3represents the significance of the average term frequency as the inverseof its logarithm base 2. Thus, factor K3 varies between 0 and 1. If theaverage term frequency is 1, then factor K3 will be 1. As the averageterm frequency approaches infinity, the factor K3 approaches 0.

The constants C_(x) that are used to calculate improved term frequencyare described in the following. One skilled in the art will appreciatethat various different definitions for the constants can be used asempirical evidence suggests different contributions of the variousfactors. Also, the definition of the enhanced and adjusted termfrequency can be modified to factor in the empirical evidence.

C_(Structure)

The constant C_(Structure) is defined for the structures of a hypertextmarkup language (“HTML”) document. The table below gives sample valuesfor the constant C_(Structure) based on presence of the term in the URLfor the document, the title of the document, the keyword section of thedocument, and the content of the document. The absence or presence in astructure is indicated by a “0” or “1.” For example, when the term ispresent in the URL, keyword section, and content, but not in the title,the value of the constant C_(Structure is) 6. The value of constantC_(Structure) varies between 0 and 8. One skilled in the art willappreciate that various other structures (e.g., abstract) of a documentmay be factored in when determining the value for constantC_(Structure). Also, different weighting for the various structures canbe used.

URL TITLE KEYWORD CONTENT C_(Structure) 0 0 0 1 0 0 0 1 0 1 0 0 1 1 3 01 0 0 2 0 1 0 1 3 0 1 1 0 3 0 1 1 1 5 1 0 0 0 3 1 0 0 1 4 1 0 1 0 4 1 01 1 6 1 1 0 0 5 1 1 0 1 6 1 1 1 0 6 1 1 1 1 8

C_(Format)

The constant C_(Format) represents any emphasis (e.g., bold or italics)placed on the term within the document. The value of the constantC_(Format) varies between 0 and 2. For example, if the term isitalicized at every occurrence, then the value of the constantC_(Format) may be 2. In contrast, if the term is italicized in 10% ofits occurrences, then the value of the constant C_(Format) may be 1.

C_(ClosenessToTop)

The constant C_(ClosenessToTop) represents the proximity of the term tothe beginning of the document. The constant C_(ClosenessToTop) variesbetween 0 and 3. For example, if the term occurs in every one of thefirst 5% of the sentences in the document, then the value for constantC_(ClosenessToTop) is 3. If, however, the term is not in any of thefirst 5% of the sentences, then the value of the constantC_(ClosenessToTop) is 0.

C_(SpamStrength)

The constant C_(SpamStrength) represents the strength of spamming, ifany, that is detected in the document. The constant C_(SpamStrength)varies between 0 and 1.

C_(Distribution)

The constant C_(Distribution) represents the fraction of the sentencesof the document in which the term occurs. The constant C_(Distribution)is the total number of sentences in which the term occurs divided by thetotal number of sentences in the document. The constant C_(Distribution)varies between 1/S to 1, where S is the number of sentences in thedocument.

C_(UniqueTermCount)

The constant C_(UniqueTermCount) represents the number of unique termsin the document. The constant C_(UniqueTermCount) varies between 1 andthe number of occurrences of terms in the document.

C_(TotalTermCount)

The constant C_(TotalTermCount) represents the total number of terms inthe document.

The following table represents the minimum and maximum values for theraw, enhanced, and adjusted term frequencies.

TF MIN MAX TF_(Raw) 1 ∞ TF_(Enhanced) TF_(Raw) 2(1 + Base*13)*TF_(Raw)TF_(Adjusted) 0*TF_(Enhanced) TF_(Enhanced)

The enhanced term frequency has a minimum value of the raw termfrequency when the constants C_(Structure), C_(Format),C_(ClosenessToTop), and C_(Distribution) each have a value of 0 and amaximum value when each of these constants are at their maximum value.The maximum value is 2*(1+Base*13)*TF_(Raw). The adjusted term frequencyhas its minimum value when the constants C_(Spamming) andC_(UniqueTermCount) are at their maximum value and when the constantC_(TotalTermCount) is largest and has its maximum value when theconstant C_(Spamming) is at its maximum value and the constantsC_(TotalTermCount) and C_(UniqueTermCount) are equal.

In an alternate embodiment, the weighting system may use a simple termfrequency or a complete term frequency as the improved term frequency.The simple term frequency of a term is the number of times that the termis displayed when the document is displayed. Thus, the simple termfrequency represents the occurrences of the term in visible portions ofthe document. In, for example, HTML documents, the visible portions ofthe document do not include comments, metadata fields, formatting tags,and so on. The complete term frequency of a term is the number of timesthat the term occurs anywhere in the document. In some calculations ofthe complete term frequency, the occurrences with HTML formatting tagsare excluded.

In one embodiment, the weighting system stores with each document thoseterms of the document with the highest weights along with their weights.These terms are referred to as “smart words.” For example, the weightingsystem may store up to 30 smart words in order based on their weights ina string-valued property of a document, referred to as the“SmartKeyWords” property. The weighting system may also store thecorresponding weights of the smart words in the same order in anotherstring-valued property, referred as the “SmartKeyWordWeights” property.The weighting system limits smart words to those terms that happen tooccur in a minimum number of documents, for example, 10. This minimumnumber may be established based on the number of documents displayed asa result of a search. The minimum number helps prevent misspelled wordsfrom being “SmartWords”.

Although the present invention has been described in terms of variousembodiments, it is not intended that the invention be limited to theseembodiments. Modifications within the spirit of the invention will beapparent to those skilled in the art. The scope of the present inventionis defined by the claims that follow.

What is claimed is:
 1. A method in a computer system for generating aweight for a term within one of a plurality of documents, the methodcomprising: generating a term frequency that represents a number oftimes that the term occurs in the one document; generating a total termfrequency that represents a total number of times the term occurs in theplurality of documents; calculating a normalized term frequency byfactoring the generated term frequency by a normalizing function of thegenerated total term frequency, wherein the normalizing functionsubstantially equalizes result of term frequency and reciprocal totalterm frequency on the weight of the term; and combining the calculatednormalized term frequency with a document frequency to generate theweight for the term.
 2. The method of claim 1 wherein the normalizingfunction comprises a reciprocal of a root of the generated total termfrequency.
 3. The method of claim 2 wherein the root is a square root.4. The method of claim 2 wherein the root is a cube root.
 5. The methodof claim 1 wherein the normalizing function comprises a reciprocal of apower of the generated total term frequency.
 6. The method of claim 5wherein the power is between 0 and
 1. 7. The method of claim 1 whereinthe normalizing function comprises a reciprocal of a logarithm of thegenerated total term frequency.
 8. The method of claim 1 wherein thecombining of the calculated normalized term frequency with a documentfrequency multiplies a logarithm of the calculated normalized termfrequency by a logarithm of the document frequency.
 9. The method ofclaim 8 wherein bases of the logarithms are different.
 10. The method ofclaim 8 wherein bases of the logarithms are calculated so that onaverage the logarithms of the calculated normalized term frequency andthe logarithms of the document frequency are equal.
 11. The method ofclaim 8 wherein$\alpha = \left\lceil {\beta \quad}^{(\frac{\log_{\beta}{({1 + {a\quad {TF}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α s the base of the logarithm of the calculated normalized termfrequency, where β is the base of the logarithm of the documentfrequency, where aTF is an average of the term frequencies for eachterm, where N is the number of documents, and where an is an average ofthe number of documents in which each term is contained.
 12. The methodof claim 8 wherein$\alpha = \left\lceil \beta^{(\frac{\log_{\beta}{({\beta - 1 + {{aTF}*{({{an}*{aTF}})}^{{- 1}/2}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α is the base of the logarithm of the calculated normalized termfrequency, where β is the base of the logarithm of the documentfrequency, where aTF is an average of the term frequencies for eachterm, where N is the number of documents, and where an is an average ofthe number of documents in which each term is contained.
 13. The methodof claim 8 wherein$\alpha = \left\lceil \beta^{(\frac{\log_{\beta}{({\beta - 1 + {{aTF}*{\log_{\gamma}{({\gamma - 1 + {{an}*{aTF}}})}}^{- 1}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α is the base of the logarithm of the calculated normalized termfrequency, where β is the base of the logarithm of the documentfrequency, where γ is the base of the logarithm for normalizing the termfrequency, where aTF is an average of the term frequencies for eachterm, where N is the number of documents, and where an is an average ofthe number of documents in which each term is contained.
 14. The methodof claim 1 wherein the generated term frequency is an improved termfrequency.
 15. The method of claim 1 wherein the generated termfrequency is enhanced based on factors that are independent of otherterms in the document.
 16. The method of claim 1 wherein the generatedterm frequency is adjusted based on factors related to the term andfactors related to other terms in the document.
 17. A computer-readablemedium containing computer-readable instructions for performing themethod of claim
 1. 18. A method in a computer system for selecting aformula for weighting terms within one of a plurality of documents, themethod comprising: generating an average term frequency that representsan average of term frequencies for at least one of a plurality of termswithin each document, the term frequency being the number of times thata term occurs in a document; generating an average inverse documentfrequency that represents an average of inverse document frequencies forthe at least one of a plurality of terms, the inverse document frequencyof the term being the number of documents divided by the number ofdocuments in which the term occurs; and selecting a first function and asecond function so that the result of the first function of thegenerated average term frequency is approximately equal to the result ofthe second function of the generated average inverse document frequency.19. The method of claim 18 wherein the first and second functions are alogarithmic functions and the identifying of the functions includescalculating a first base for a logarithm and a second base for alogarithm so that the logarithm of the first base of the generatedaverage term frequency is approximately equal to the logarithm of thesecond base of the generated average inverse document frequency.
 20. Themethod of claim 19 wherein the calculating of the second base calculatesthe second base to be equal to a multiplication factor by which a changein the generated average inverse document frequency results in a changeof one in the logarithm of the second base of the generated averageinverse document frequency.
 21. The method of claim 19 wherein$\alpha = \left\lceil {\beta \quad}^{(\frac{\log_{\beta}{({1 + {a\quad {TF}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α is the first base, where β is the second base, where aTF is anaverage of the term frequencies for each term, where N is the number ofdocuments, and where an is an average of the number of documents inwhich each term is contained.
 22. The method of claim 19 wherein$\alpha = \left\lceil \beta^{(\frac{\log_{\beta}{({\beta - 1 + {{aTF}*{({{an}*{aTF}})}^{{- 1}/2}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α is the first base, where β is the second base, where aTF is anaverage of the term frequencies for each term, where N is the number ofdocuments, and where an is an average of the number of documents inwhich each term is contained.
 23. The method of claim 19 wherein$\alpha = \left\lceil \beta^{(\frac{\log_{\beta}{({\beta - 1 + {{aTF}*{\log_{\gamma}{({\gamma - 1 + {{an}*{aTF}}})}}^{- 1}}})}}{\log_{\beta}\frac{N}{an}})} \right\rceil$

where α is the first base, where β is the second base, where γ is thebase of the logarithm for normalizing the term frequency, where aTF isan average of the term frequencies for each term, where N is the numberof documents, and where an is an average of the number of documents inwhich each term is contained.
 24. The method of claim 18 wherein thegenerated term frequency is an improved term frequency.
 25. The methodof claim 18 wherein the generated term frequency is enhanced based onfactors that are independent of other terms in the document.
 26. Themethod of claim 18 wherein the generated term frequency is adjustedbased on factors related to the term and factors related to other termsin the document.
 27. A system for generating a weight for a term withinone of a plurality of documents, the system comprising: a first termfrequency generator that computes a number of times that the term occursin the one of the plurality of documents; a second term frequencygenerator that computes a total number of times the term occurs in theplurality of documents; a normalizer that calculates a normalized termfrequency by factoring the number of times that the term occurs in theone of the plurality of documents by a normalizing function of the totalnumber of times the term occurs in the plurality of documents, thenormalizing function substantially equalizing the result of termfrequency and reciprocal total term frequency on the weight of the term;and a combiner that combines the calculated normalized term frequencywith a document frequency to generate the weight for the term.